Regional Analysis of Preterm Birth Characteristics

BMIN503/EPID600 Final Project

Author

Rose Albert


1 Interrogating sex as a biological variable in county-level preterm birth inequity

December 13, 2024

By Rose Albert

1.1 Overview

Preterm birth affects approximately 10% of all births, and there is a well-established male vulnerability and female resilience during the neonatal period. The goal of this project is to interrogate sex as a biological variable at the county-level using CDC Wonder Natality Data and identify sex-differences in preterm birth rates and NICU admission. I have spoken with Dr. Aimin Chen and Dr. Heather Burris about biological and structural determinants contributing to preterm birth inequity, as well as potential areas of clinical, public health, and policy intervention.

The data for this project can be found in my final project GitHub repository.

1.2 Introduction

Preterm birth can predispose infants to immediate risks in the neonatal period as well as long-term adverse sequelae. For example, preterm infants are born with underdeveloped lungs and often require therapeutic interventions such as surfactant administration, antibiotics, steroids, mechanical ventilation, and supplemental oxygen for survival. However, these therapeutics can also pose postnatal insults and contribute to lung injury and diseases such as bronchopulmonary dysplasia.

Interrogation of sex as a biological variable has received increasing attention as a critical area for understanding basic pathophysiology and differential response to therapeutics. Male sex is an independent risk factor for numerous pediatric diseases and diseases of prematurity. This project will investigate male susceptibility to preterm birth and NICU admission at the county-level. An understanding of preterm birth characteristics can inform local public health interventions.

1.3 Methods

Data curation and cleaning

County-level Birth Characteristics Data Source: I am using the CDC Wonder Natality Data (2023) grouped by “County of Residence,” “Sex of Infant,” “NICU Admission,” and “OE Gestational Age Recode 10,” with the measures “Births” and “Percent of Total Births.”

Mapping Data Source: I am drawing from Assignment 5 to use geospatial data, tidy census, and leaflet to generate interactive maps.update.packages(“tidycensus”)

#load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(tidycensus)
library(leaflet)
options(tigris_use_cache = TRUE)
options(progress_enabled = FALSE)

#Load county-level birth data downloaded from CDC Wonder
birth_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded_sex.txt")

county_data <- read.delim("https://raw.githubusercontent.com/rosemalbert/BMIN503_Final_Project/refs/heads/master/Natality%2C%202016-2023%20expanded%20_county.txt")

#view column names
colnames(birth_data)
 [1] "Notes"                             "County.of.Residence"              
 [3] "County.of.Residence.Code"          "Sex.of.Infant"                    
 [5] "Sex.of.Infant.Code"                "NICU.Admission"                   
 [7] "NICU.Admission.Code"               "OE.Gestational.Age.Recode.10"     
 [9] "OE.Gestational.Age.Recode.10.Code" "Births"                           
[11] "X..of.Total.Births"               
colnames(county_data)
[1] "Notes"                    "County.of.Residence"     
[3] "County.of.Residence.Code" "Births"                  
[5] "Total.Population"         "Birth.Rate"              
[7] "X..of.Total.Births"      
#merge birth_data and county_data
merged_data <- left_join(birth_data, county_data, by = "County.of.Residence.Code")
Warning in left_join(birth_data, county_data, by = "County.of.Residence.Code"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 11803 of `x` matches multiple rows in `y`.
ℹ Row 250 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
#clean merged_data 
library(dplyr)

# clean merged data by removing unnecessary columns, checking discrepancies, and renaming
cleaned_data <- merged_data %>%
    # Remove 'Notes.x' column
    select(-Notes.x) %>%
    
    # Check for discrepancies between 'County.of.Residence.x' and 'County.of.Residence.y'
    # 
    mutate(
        County.of.Residence = ifelse(
            County.of.Residence.x == County.of.Residence.y, 
            County.of.Residence.x, 
            NA  
        )
    ) %>%
    
    # Merge 'County.of.Residence.x' and 'County.of.Residence.y'
    select(-County.of.Residence.x, -County.of.Residence.y) %>%
    
    # Rename 'Births.y' to 'Total.Births'
    rename(
        Total.Births = Births.y
    ) %>%
    
    # Remove 'Notes.y' column
    select(-Notes.y) %>%
  
    # Remove sex of infant code
    select(-Sex.of.Infant.Code) %>%
    
    # Remove 'X..of.Total.Births.y' and 'X..of.Total.Births.x'
    select(-c(X..of.Total.Births.x, X..of.Total.Births.y)) %>%
  
   # Remove 'OE.Gestational.Age.Recode.10.Code'
    select(-OE.Gestational.Age.Recode.10.Code) %>%
  
   # Remove 'NICU.Admission.Code'
    select(-NICU.Admission.Code) %>%
  
    # Filter out rows where 'NICU.Admission' or 'OE.Gestational.Age.Recode.10' are "Unknown"
    filter(
        !NICU.Admission %in% c("Unknown or Not Stated", ""),
        !OE.Gestational.Age.Recode.10 %in% c("Unknown or Not Stated", "")
    ) %>%
    
    # Remove any duplicates
    distinct()

cleaned_data <- cleaned_data %>%
  mutate(County.of.Residence.Code = as.character(County.of.Residence.Code))

# View cleaned data
head(cleaned_data)
  County.of.Residence.Code Sex.of.Infant NICU.Admission
1                    24033        Female             No
2                    24033          Male             No
3                    24037          Male            Yes
4                    24999        Female             No
5                    24999          Male            Yes
6                    25003        Female             No
  OE.Gestational.Age.Recode.10 Births.x Total.Births Total.Population
1                28 - 31 weeks       10        10527           947430
2             42 weeks or more       10        10527           947430
3                32 - 35 weeks       10         1309           115281
4                28 - 31 weeks       10         5546           550411
5                     40 weeks       10         5546           550411
6                     36 weeks       10          890           126818
  Birth.Rate        County.of.Residence
1      11.11 Prince George's County, MD
2      11.11 Prince George's County, MD
3      11.35      St. Mary's County, MD
4      10.08  Unidentified Counties, MD
5      10.08  Unidentified Counties, MD
6       7.02       Berkshire County, MA
# Convert 'County.of.Residence.Code' to character before merging
preterm_rate_data <- cleaned_data %>%
   mutate(
        Is.Preterm = OE.Gestational.Age.Recode.10 %in% c("28 - 31 weeks", "32 - 35 weeks", "36 weeks", "20 - 27 weeks", "Under 20 weeks"),
        County.of.Residence.Code = as.character(County.of.Residence.Code)  # Convert to character
    ) %>%
   group_by(County.of.Residence, County.of.Residence.Code) %>%  # Keep 'County.of.Residence.Code' in the group_by
    summarise(
        Total.Births = sum(Total.Births, na.rm = TRUE),
        Preterm.Births = sum(ifelse(Is.Preterm, Births.x, 0), na.rm = TRUE),
        Preterm.Birth.Rate = (Preterm.Births / Total.Births) * 100
    ) %>%
   ungroup()
`summarise()` has grouped output by 'County.of.Residence'. You can override
using the `.groups` argument.
# View the result to ensure 'County.of.Residence.Code' is retained
head(preterm_rate_data)
# A tibble: 6 × 5
  County.of.Residence County.of.Residence.Code Total.Births Preterm.Births
  <chr>               <chr>                           <int>          <dbl>
1 Ada County, ID      16001                          122064            408
2 Adams County, CO    8001                           139436            566
3 Adams County, PA    42001                            9130             26
4 Aiken County, SC    45003                           26698            169
5 Alachua County, FL  12001                           48317            266
6 Alamance County, NC 37001                           28320            158
# ℹ 1 more variable: Preterm.Birth.Rate <dbl>
#load census shapefile data and merge by county FIPs 
counties_sf <- counties(cb = TRUE, resolution = "20m") # cb = TRUE for a simplified geometry
Retrieving data for the year 2022
# Merge the cleaned data with the counties shapefile
county_map_preterm_data <- left_join(counties_sf, preterm_rate_data, by = c("GEOID" = "County.of.Residence.Code"))

county_map_cleaned_data <- left_join(counties_sf, cleaned_data, by = c("GEOID" = "County.of.Residence.Code"))

# Integrate with US Census Data

Analysis plan

Exploratory analysis of data: Generate county-level choropleth maps of birth rates ((births/total population)*1000 for each county). Generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total population). Generate sex-stratified maps of preterm birth rates. Generate sex-stratified choropleth maps of NICU admissions by county. Generate bar graphs showing NICU admission rates by sex and sex distribution of gestational age groups.

Data analysis: Conduct 2-way ANOVA to determine if sex as a biological variable is a significant predictor of NICU admission. Identify counties with highest and lowest preterm birth rates, and the highest and lowest NICU admission rates.

1.4 Results

First, I developed a county-level choropleth maps of birth rates (live births per 1000 people) for counties that did not have any missing data. The below map represents 496 counties. Rockland County, NY has the highest birth rate (19.4 per 1000 people) while Charlotte County, FL has the lowest birth rate (4.99 per 1000 people).

#load necessary libraries
library(tidyverse)
library(sf)
library(tidycensus)
library(ggspatial)
library(leaflet)
library(cowplot)

Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':

    stamp
library(nhanesA)
library(haven)
library(viridis)
Loading required package: viridisLite
# Convert Birth.Rate to numeric (if necessary)
county_map_cleaned_data$Birth.Rate <- as.numeric(county_map_cleaned_data$Birth.Rate)

# Recheck the structure of the column
str(county_map_cleaned_data$Birth.Rate)
 num [1:11636] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NA
county_map_cleaned_data <- county_map_cleaned_data %>%
  filter(!is.na(Birth.Rate))

#load map theme
map_theme <- function() {
  theme_minimal() +  
    theme(
      axis.line = element_blank(), 
      axis.text = element_blank(),  
      axis.title = element_blank(),
      panel.grid = element_line(color = "white"), 
      legend.key.size = unit(0.8, "cm"),     
      legend.text = element_text(size = 16),   
      legend.title = element_text(size = 16)
    )
}

#create my palette
myPalette <- colorRampPalette(viridis(100))

# Generate the choropleth map for birth rates
birth_rates_map <- ggplot(data = county_map_cleaned_data) +
  geom_sf(aes(fill = Birth.Rate)) +  # Map Preterm Birth Rate to the fill
  map_theme() +  # Apply the custom theme
  ggtitle("2023 Birth Rate (live births per 1000 people)") +
  scale_fill_gradientn(name = "Birth Rate\nrate", colours = myPalette(100))
  coord_sf(xlim = c(-79.5, -75), ylim = c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
    aspect: function
    backtransform_range: function
    clip: on
    crs: NULL
    datum: crs
    default: FALSE
    default_crs: NULL
    determine_crs: function
    distance: function
    expand: TRUE
    fixup_graticule_labels: function
    get_default_crs: function
    is_free: function
    is_linear: function
    label_axes: list
    label_graticule: 
    labels: function
    limits: list
    lims_method: cross
    modify_scales: function
    ndiscr: 100
    params: list
    range: function
    record_bbox: function
    render_axis_h: function
    render_axis_v: function
    render_bg: function
    render_fg: function
    setup_data: function
    setup_layout: function
    setup_panel_guides: function
    setup_panel_params: function
    setup_params: function
    train_panel_guides: function
    transform: function
    super:  <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
# Print the map
print(birth_rates_map)

#How many counties are represented in this analysis?
num_unique_geoid_counties <- length(unique(county_map_cleaned_data$GEOID))
print(num_unique_geoid_counties)
[1] 496
#Which counties have the highest and lowest birth rates?
# Find the county with the highest Birth Rate and its value (show only the first entry)
highest_birth_rate <- county_map_cleaned_data %>%
  filter(Birth.Rate == max(Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the highest Birth Rate: ", highest_birth_rate$County.of.Residence, "\n")
County with the highest Birth Rate:  Rockland County, NY 
cat("Highest Birth Rate: ", highest_birth_rate$Birth.Rate, "\n")
Highest Birth Rate:  19.4 
# Find the county with the lowest Birth Rate and its value (show only the first entry)
lowest_birth_rate <- county_map_cleaned_data %>%
  filter(Birth.Rate == min(Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the lowest Birth Rate: ", lowest_birth_rate$County.of.Residence, "\n")
County with the lowest Birth Rate:  Charlotte County, FL 
cat("Lowest Birth Rate: ", lowest_birth_rate$Birth.Rate, "\n")
Lowest Birth Rate:  4.99 

To better visualize and explore the birth rates, I then generated an interactive choropleth map of birth rates by county.

#generate interactive choropleth map 
library(leaflet)
library(RColorBrewer)

# Define the color scale for the Birth Rate using colorBin
pal_fun <- colorBin(palette = brewer.pal(9, "YlOrRd")[c(1:5, 7)], 
                    bins = c(0, 5, 10, 15, 20, 25), 
                    reverse = FALSE)

# Create the popup message for each county
pu_message <- paste0(county_map_cleaned_data$County.of.Residence, 
                     "<br>Birth Rate: ",      
                     round(county_map_cleaned_data$Birth.Rate, 1), " live births per 1000 people")

# Create the leaflet map with the polygons
leaflet(county_map_cleaned_data) |>
  addPolygons(stroke = FALSE,               
              fillColor = ~pal_fun(Birth.Rate),
              fillOpacity = 0.7, smoothFactor = 0.5,
              popup = pu_message) |>   
  addProviderTiles(providers$CartoDB.Positron) |>   
  addLegend("bottomright",               
            pal = pal_fun,              
            values = ~Birth.Rate,  
            title = 'Birth Rate',    
            opacity = 1) |>             
  addScaleBar()
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

I repeated the analysis to generate county-level choropleth maps of preterm birth rates (<37 gestational age births/total number of live births).

# Convert Preterm.Birth.Rate to numeric
county_map_preterm_data$Preterm.Birth.Rate <- as.numeric(county_map_preterm_data$Preterm.Birth.Rate)

# Recheck the structure of the column
str(county_map_preterm_data$Preterm.Birth.Rate)
 num [1:3222] NA NA NA NA NA ...
# Remove rows where Birth.Rate is NA
county_map_preterm_data <- county_map_preterm_data %>%
  filter(!is.na(Preterm.Birth.Rate))

# Generate the choropleth map for preterm birth rates
pretetrm_birth_rates_map <- ggplot(data = county_map_preterm_data) +
  geom_sf(aes(fill = Preterm.Birth.Rate)) +  # Map Preterm Birth Rate to the fill
  map_theme() +  # Apply the custom theme
  ggtitle("2023 Preterm Birth Rate % (<37 gestational age births/total number of live births)") +
  scale_fill_gradientn(name = "Preterm Birth Rate\nrate", colours = myPalette(100))
  coord_sf(xlim = c(-79.5, -75), ylim = c(37, 39))
<ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
    aspect: function
    backtransform_range: function
    clip: on
    crs: NULL
    datum: crs
    default: FALSE
    default_crs: NULL
    determine_crs: function
    distance: function
    expand: TRUE
    fixup_graticule_labels: function
    get_default_crs: function
    is_free: function
    is_linear: function
    label_axes: list
    label_graticule: 
    labels: function
    limits: list
    lims_method: cross
    modify_scales: function
    ndiscr: 100
    params: list
    range: function
    record_bbox: function
    render_axis_h: function
    render_axis_v: function
    render_bg: function
    render_fg: function
    setup_data: function
    setup_layout: function
    setup_panel_guides: function
    setup_panel_params: function
    setup_params: function
    train_panel_guides: function
    transform: function
    super:  <ggproto object: Class CoordSf, CoordCartesian, Coord, gg>
pretetrm_birth_rates_map

#How many counties are represented in this analysis?
num_unique_geoid_counties_preterm <- length(unique(county_map_preterm_data$GEOID))
print(num_unique_geoid_counties_preterm)
[1] 496
#Which counties have the highest and lowest preterm birth rates?
# Find the county with the highest preterm Birth Rate and its value (show only the first entry)
highest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == max(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the highest preterm Birth Rate: ", highest_preterm_birth_rate$County.of.Residence, "\n")
County with the highest preterm Birth Rate:  Kanawha County, WV 
cat("Highest preterm Birth Rate: ", highest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Highest preterm Birth Rate:  1.178862 
# Find the county with the lowest preterm Birth Rate and its value (show only the first entry)
lowest_preterm_birth_rate <- county_map_preterm_data %>%
  filter(Preterm.Birth.Rate == min(Preterm.Birth.Rate, na.rm = TRUE)) %>%
  slice(1)  # Select the first row

cat("County with the lowest preterm Birth Rate: ", lowest_preterm_birth_rate$County.of.Residence, "\n")
County with the lowest preterm Birth Rate:  Ocean County, NJ 
cat("Lowest preterm Birth Rate: ", lowest_preterm_birth_rate$Preterm.Birth.Rate, "\n")
Lowest preterm Birth Rate:  0.2631598 

Describe your results and include relevant tables, plots, and code/comments used to obtain them. You may refer to the Section 1.3 as needed. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

1.5 Conclusion

This the conclusion. The Section 1.4 can be invoked here.